22 research outputs found
Entropy Transformer Networks: A Learning Approach via Tangent Bundle Data Manifold
This paper focuses on an accurate and fast interpolation approach for image
transformation employed in the design of CNN architectures. Standard Spatial
Transformer Networks (STNs) use bilinear or linear interpolation as their
interpolation, with unrealistic assumptions about the underlying data
distributions, which leads to poor performance under scale variations.
Moreover, STNs do not preserve the norm of gradients in propagation due to
their dependency on sparse neighboring pixels. To address this problem, a novel
Entropy STN (ESTN) is proposed that interpolates on the data manifold
distributions. In particular, random samples are generated for each pixel in
association with the tangent space of the data manifold and construct a linear
approximation of their intensity values with an entropy regularizer to compute
the transformer parameters. A simple yet effective technique is also proposed
to normalize the non-zero values of the convolution operation, to fine-tune the
layers for gradients' norm-regularization during training. Experiments on
challenging benchmarks show that the proposed ESTN can improve predictive
accuracy over a range of computer vision tasks, including image reconstruction,
and classification, while reducing the computational cost
Salient Skin Lesion Segmentation via Dilated Scale-Wise Feature Fusion Network
Skin lesion detection in dermoscopic images is essential in the accurate and
early diagnosis of skin cancer by a computerized apparatus. Current skin lesion
segmentation approaches show poor performance in challenging circumstances such
as indistinct lesion boundaries, low contrast between the lesion and the
surrounding area, or heterogeneous background that causes over/under
segmentation of the skin lesion. To accurately recognize the lesion from the
neighboring regions, we propose a dilated scale-wise feature fusion network
based on convolution factorization. Our network is designed to simultaneously
extract features at different scales which are systematically fused for better
detection. The proposed model has satisfactory accuracy and efficiency. Various
experiments for lesion segmentation are performed along with comparisons with
the state-of-the-art models. Our proposed model consistently showcases
state-of-the-art results
Efficient Object Detection in Optical Remote Sensing Imagery via Attention-based Feature Distillation
Efficient object detection methods have recently received great attention in
remote sensing. Although deep convolutional networks often have excellent
detection accuracy, their deployment on resource-limited edge devices is
difficult. Knowledge distillation (KD) is a strategy for addressing this issue
since it makes models lightweight while maintaining accuracy. However, existing
KD methods for object detection have encountered two constraints. First, they
discard potentially important background information and only distill nearby
foreground regions. Second, they only rely on the global context, which limits
the student detector's ability to acquire local information from the teacher
detector. To address the aforementioned challenges, we propose Attention-based
Feature Distillation (AFD), a new KD approach that distills both local and
global information from the teacher detector. To enhance local distillation, we
introduce a multi-instance attention mechanism that effectively distinguishes
between background and foreground elements. This approach prompts the student
detector to focus on the pertinent channels and pixels, as identified by the
teacher detector. Local distillation lacks global information, thus attention
global distillation is proposed to reconstruct the relationship between various
pixels and pass it from teacher to student detector. The performance of AFD is
evaluated on two public aerial image benchmarks, and the evaluation results
demonstrate that AFD in object detection can attain the performance of other
state-of-the-art models while being efficient
Hybrid Gromov-Wasserstein Embedding for Capsule Learning
Capsule networks (CapsNets) aim to parse images into a hierarchy of objects,
parts, and their relations using a two-step process involving part-whole
transformation and hierarchical component routing. However, this hierarchical
relationship modeling is computationally expensive, which has limited the wider
use of CapsNet despite its potential advantages. The current state of CapsNet
models primarily focuses on comparing their performance with capsule baselines,
falling short of achieving the same level of proficiency as deep CNN variants
in intricate tasks. To address this limitation, we present an efficient
approach for learning capsules that surpasses canonical baseline models and
even demonstrates superior performance compared to high-performing convolution
models. Our contribution can be outlined in two aspects: firstly, we introduce
a group of subcapsules onto which an input vector is projected. Subsequently,
we present the Hybrid Gromov-Wasserstein framework, which initially quantifies
the dissimilarity between the input and the components modeled by the
subcapsules, followed by determining their alignment degree through optimal
transport. This innovative mechanism capitalizes on new insights into defining
alignment between the input and subcapsules, based on the similarity of their
respective component distributions. This approach enhances CapsNets' capacity
to learn from intricate, high-dimensional data while retaining their
interpretability and hierarchical structure. Our proposed model offers two
distinct advantages: (i) its lightweight nature facilitates the application of
capsules to more intricate vision tasks, including object detection; (ii) it
outperforms baseline approaches in these demanding tasks
Distance Weighted Trans Network for Image Completion
The challenge of image generation has been effectively modeled as a problem
of structure priors or transformation. However, existing models have
unsatisfactory performance in understanding the global input image structures
because of particular inherent features (for example, local inductive prior).
Recent studies have shown that self-attention is an efficient modeling
technique for image completion problems. In this paper, we propose a new
architecture that relies on Distance-based Weighted Transformer (DWT) to better
understand the relationships between an image's components. In our model, we
leverage the strengths of both Convolutional Neural Networks (CNNs) and DWT
blocks to enhance the image completion process. Specifically, CNNs are used to
augment the local texture information of coarse priors and DWT blocks are used
to recover certain coarse textures and coherent visual structures. Unlike
current approaches that generally use CNNs to create feature maps, we use the
DWT to encode global dependencies and compute distance-based weighted feature
maps, which substantially minimizes the problem of visual ambiguities.
Meanwhile, to better produce repeated textures, we introduce Residual Fast
Fourier Convolution (Res-FFC) blocks to combine the encoder's skip features
with the coarse features provided by our generator. Furthermore, a simple yet
effective technique is proposed to normalize the non-zero values of
convolutions, and fine-tune the network layers for regularization of the
gradient norms to provide an efficient training stabiliser. Extensive
quantitative and qualitative experiments on three challenging datasets
demonstrate the superiority of our proposed model compared to existing
approaches
Road Segmentation for Remote Sensing Images using Adversarial Spatial Pyramid Networks
Road extraction in remote sensing images is of great importance for a wide
range of applications. Because of the complex background, and high density,
most of the existing methods fail to accurately extract a road network that
appears correct and complete. Moreover, they suffer from either insufficient
training data or high costs of manual annotation. To address these problems, we
introduce a new model to apply structured domain adaption for synthetic image
generation and road segmentation. We incorporate a feature pyramid network into
generative adversarial networks to minimize the difference between the source
and target domains. A generator is learned to produce quality synthetic images,
and the discriminator attempts to distinguish them. We also propose a feature
pyramid network that improves the performance of the proposed model by
extracting effective features from all the layers of the network for describing
different scales objects. Indeed, a novel scale-wise architecture is introduced
to learn from the multi-level feature maps and improve the semantics of the
features. For optimization, the model is trained by a joint reconstruction loss
function, which minimizes the difference between the fake images and the real
ones. A wide range of experiments on three datasets prove the superior
performance of the proposed approach in terms of accuracy and efficiency. In
particular, our model achieves state-of-the-art 78.86 IOU on the Massachusetts
dataset with 14.89M parameters and 86.78B FLOPs, with 4x fewer FLOPs but higher
accuracy (+3.47% IOU) than the top performer among state-of-the-art approaches
used in the evaluation